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Abstract 

The distribution of genetic polymorphisms in a population contains information 
about the mutation rate and the strength of natural selection at a locus. Here, we 
show that the Poisson Random Field (PRF) method of population-genetic inference 
suffers from systematic biases that tend to underestimate selection pressures and 
mutation rates, and that erroneously infer positive selection. These problems arise 
from the infinite-sites approximation inherent in the PRF method. We introduce 
three new inference techniques that correct these problems. We present a finite- 
site modification of the PRF method, as well as two new methods for inferring 
selection pressures and mutation rates based on diffusion models. Our methods 
can be used to infer not only a "weighted average" of selection pressures acting on 
a gene sequence, but also the distribution of selection pressures across sites. We 
evaluate the accuracy of our methods, as well that of the original PRF approach, 
by comparison with Wright-Fisher simulations. 
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INTRODUCTION 



The mutation rate and selection pressures operating on genes are of central importance in 
shaping their evolution. The number and frequency distribution of genetic polymorphisms 
within a population carry information about these fundamental processes. Polymorphisms 
at higher frequencies reflect weaker selective pressures (or positive selection), and vice versa. 
Similarly, a larger number of polymorphisms indicates a higher mutation rate. Thus we can 
use the polymorphism frequency spectrum observed in genetic sequences sampled from a 
population in order to infer the mutation rate and the strength and direction of selection 
acting on the sequence. 

This intuition can be formalized into a rigorous method for estimating selection pressures 
and mutation rates by calculating the likelihood of sampled polymorphism data as a function 
of these parameters. The Poisson Random Field (PRF) model provides an important and 
widely-used method of doing so. The PRF model assumes a panmictic population of con- 
stant size, free recombination, infinite sites, no d ominance or epistasis, and e qual selection 



pressures at all sites. Under these assumptions, ISawyer and HartlI (I1992| ) showed that 
the distribution of frequencies of mutant lineages in a population forms a Po isson random 
field w hose properties depend on the se lection pressure and the mutation rate. 
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mm and 



BUSTAMANTE et al 



(I2OOII ) developed a maximum likelihood method of estimat- 



ing these parameters from data on the polymorphism frequency spectrum. T 



lis method has 



been widely used to study , for example, purifying se 
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Closely related to these analyses of polymorphism data are methods that calculate, based 
on the PRF model, the ratio of the expected number of polymorphisms within species 
to divergence between species for syn onymous and nonsynonymous sites (using the idea 
behind the McDonald-Kreitman test (IMcDONALD and Kreitmani . Il99ll )). These meth- 
ods discard some of the available data, as they depend only on the number of polymor- 



phisms and not their full 



assumptions (ILoewe et al. 



r equency spectrum. However, the y are also less sensitive to 
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applied to estimate selection pressures on synon ymous variation ( 



onymous mutations in mitocho ndrial genomes (INachman 
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Much recent theoretical work has focused on relaxing vario us assumptions of t h e orig 



inal PRF method. T hese include allow ing for dominance (IWilliamson et al 



20041) . 



popu lation subdivision (IWakeleyI . l2003f). changing popula t ion si ze (IWilliamson et al. 



20051 ). and linkage between sites ( IZhu and BustamanteI . l2005l ). Several methods for 



studying the properties of the distribution of selection pressures across sites based on 



the P RF model have also 
trumj 



j een d eveloped, both using the polymorphism frequency spec- 
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All of the methods summarized above fall within the PRF framework, and therefore de- 
pend on the infinite-sites approximation. Rather than calculating the evolutionary dynamics 
at each site, these methods consider the overall steady-state distribution of mutant lineage 
frequencies across all sites. The PRF method is applied to data by assuming that each 
lineage segregates at a different site. The infinite-sites assumption is made for purely tech- 
nical, as opposed to biological, reasons. In this paper, we show that in biologically relevant 
parameter regimes the infinite-sites assumption causes the PRF method to underestimate 
selection pressures and mutation rates. This problem arises both for inferences based on the 
polymorphism frequency spectrum and for inferences based on the ratio of within-species 
polymorphism to between-species divergence, but in this paper we focus exclusively on the 
former. As we demonstrate below, the PRF method often underestimates the selection pres- 
sure and the mutation rate by as much as an order of magnitude. In addition, and perhaps 
of greater concern, the PRF method frequently infers that a gene is under strong positive 
selection when in fact the gene is experiencing weak negative selection (Fig. 1). 

In this paper, we present three methods to correct the systematic biases of the PRF 
method, each with their own advantages and drawbacks. Rather than study mutant lineages 
across a sequence, our methods all focus on explicit models of the evolutionary dynamics 
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at individual sites. We first present a modification of the PRF method that calculates the 
frequency distribution of mutant lineages at each site, rather than across the whole sequence. 
We next present two new methods based on well-known diffusion equations in place of the 
PRF framework. All three of our methods allow us to estimate the selection pressure and 
the mutation rate from data on the polymorphism frequency spectrum. In addition, these 
methods also allow us to infer the distribution of selection pressures across sites. In order 
to assess the accuracy of these methods, we generate polymorphism data from simulated 
Wright-Fisher populations with known selection pressures and mutation rates. By comparing 
inferences drawn from these simulated data sets, we demonstrate that our methods correct 
the biases inherent in the original PRF approach. 

THE POISSON RANDOM FIELD MODEL OF POLYMORPHISMS 



We begin by outlining the Poisson R andom Field (PRF ) model of the site-frequenc y 



spectrum developed by Sawyer and Hartl (IHartl et al. 



1994 



Sawyer and Hartl 



This model assumes that mutations occur in a population of effective size at a Poisson rate 
Nu, where u is the per-sequence mutation rate, and are all subject to selection of strength s. 
The fate of each mutant lineage is modeled by a diffusion approximation to the processes of 
selection and drift. When a new mutant lineage enters the population, it is assumed to arise 
at a site that has not previously experienced any mutations (the infinite-sites assumption). 
Each mutant lineage is assumed to be independent of all others (the free-recombination 
assumption). Since new mutations are continuously arising, and older mutant lineages are 
fixing or dying out, there is a steady-state distribution of the frequencies of segregating 
(i.e. non-fixed and non-extinct) mutant lineages. The expected number of segregating sites 
increases linearly with the mutation rate, but u does not affect the shape of the steady-state 
distribution of segregating mu tant freq u encies . 



Extending earlier work by 



MORAN 



Sawyer and Hartl 



( 119591 ) and IWRlGHli f ll988l ) 
( I1992I ) calculated this steady state distribution of lineage frequencies. They found that the 
number of lineages with frequency between x and x + dx is Poisson distributed with mean 
f{x)dx, where 



-27(l-x) I 

e-27 x(l — x) 



(1) 
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Here 7 = Ns is a measure of the strength of selection on the mutant hneages and 61 = 2Nu 
is twice the population per-sequence mutation rate. The function /(x) is referred to as 
the mean of a Poisson Random Field. In other words, the number of mutant lineages with 
frequency between xi and X2 is a Poisson random variable with mean J^^ f{x)dx. In addi- 
tion, the number of mutant lineages with frequency in [xi,X2] is independent (as a random 
variable) from the number of mutant lineages with frequency in [yi,y2], provided these in- 
tervals do not intersect. Note that f{x) is not integrable at or 1. This divergence arises 
because the steady state is due to a balance between new mutations constantly occurring 
and older lineages fixing or going extinct. Thus there is no finite, steady-state expression 
fo r the number o f lineages th at have fixed or gone extinct. 



Hartl et al 



(119941 ) and IBustamante et all (120011 ) used Eq. ([T]) as the basis for 



maximum-likelihood (ML) estimation of the mutation rate 61 and selection pressure 7 from 
polymorphism data. They imagined sampling and sequencing n individuals from a popu- 
lation with this steady state distribution of segregating mutant lineages. They made the 
infinite-sites assumption that all mutant lineages occur at different sites, consistent with the 
earlier assumption that each lineage is independent (used to derive Eq. ([T])). If a given site 
has a mutant lineage at frequency x in the entire population, then the probability that a 
sample of n individuals will contain i mutant nucleotides and n — i ancestral nucleotides at 
the site is (")a;*(l — x)"~*. That is, for each mutant lineage with frequency x, there is a 
probability (")a;*(l — x)* of finding a corresponding site with i mutant nucleotides. Since 
the number of mutant lineages at frequency x in the population is Poisson distributed with 
mean f{x)dx, the number of sampled sites containing i mutant nucleotides (we refer to these 
as 2-fold mutant sites) is Poisson distributed with mean 



1 1 _ g-27(l-x) I 



n 



F{j) = ei — -[ .]x\l-xY~'dx. (2) 

This equation leads imni e diate ly to a maximum likelihood procedure for estimating 7 
and 61 (iBusTAMANTE et all I2OOII ) . A set of sequences from n sampled individuals within 
a population will contain some number, of i-fold mutant sites for < i < n. The set 
of values ?/2, • • • Vn-i is called the site-frequency spectrum of the observed data. The 
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n-1 



probability of a spectrum given 7 and ^j, is 

For an observed spectrum yi in a particular data set, we can maximize this likelihood over 
9i and 7 to estimate the mutation rate and selection pressure. 

The likelihood expression above assumes we know which nucleotide is ancestral and which 
nucleotide is the mutant at each polymorphic site. We refer to this situation as the unfolded 
case. When we do not have this information, we cannot distinguish between an z-fold mutant 
site and an (n — 'i)-fold mutant site. In this case, a data set will contain some number, yi, 
i-fold and/or {n — z)-fold mutant sites, where i runs between 1 and the largest integer less 
than or equal to n/2. We refer to this as the folded case. In this situation, the likelihood of 
a particular data set Lf{6i, 7) is given by the same expression as in L„, but with the product 
running from 1 to n/2 and with F{i) replaced by F{i) + F{n — i) (except ii i = n/2). 

Shortcomings of the Poisson Random Field model 

The PRF model makes two key assumptions: that each site is independent of all the oth- 
ers, and that two mutant lineages never segregate at the same site. The former assumption 
is equivalent to assuming free recombination between all sites which segregate concurrently. 
This assumption may be violated in many populations, particularly when the method is 
applied to estimate selective pressures on short stretches of DNA (e.g. a single gene) that 
are linked over long timescales. Ideally, we would like a theory that allows us to infer the 
strength of selection acting on a large number of sites with an arbitrary degree of linkage, but 
no such theory yet exists. Nevertheless, a free recombination model is useful as a null model 
and a limiting case, and it can be used to compare with more complicated possibilities. Such 
a model may also be used to test whether recombination is necessary to explain data from a 
particular population. Furthermore, whenever selection pressures are weak, sites segregate 
over long timescales and so recombination may be frequent enough that even intragenic sites 
are unlinked over these timescales. Finally, it is important to note that linkage between seg- 
regating sites will not bias estimates of the selection pre ssure, provided it is equal across 



sites, but rather increase t he variance in such estimates (IAkashi and SchaefferI . 



BUSTAMANTE et ol. 



2001I) 



1997 : 



The second assumption of the PRF, that there are an infinite number of sites, is more 
problematic. This problem is most apparent when considering how the PRF method treats 
"multiply polymorphic" sites — those that exhibit more than two types of segregating 
nucleotides. We refer to the configuration of a particular site as (a, b, c, d), where a, b, c, and 
d are the numbers of sampled sequences which exhibit each of the four nucleotides. When 
we have unfolded data, a is the frequency of the ancestral nucleotide and b, c, and d are 
the frequencies of the three possible mutant nucleotides, in order of decreasing frequency. 
When we have folded data, a, b, c, and d are the frequencies of all four possible nucleotides, 
again in order of decreasing frequency. In the original PRF analysis, a site with a (12, 1, 1, 0) 
configuration, for example, is treated identically to a site with a (12,2,0,0) configuration. 
Such a treatment is incorrect: the former configuration can only arise from two low- frequency 
mutant lineages, whereas the latter configuration could be caused by a single high-frequency 
lineage. Yet the PRF analysis ex cludes the first possibihty, and treats both config urations as 



if they were all (12, 2, 0, 0) sites (IBustamante et ali I2OO1I : IHartl et a/.l . Il994j ). Similarly, 



the PRF method treats (10,2,2,0), (10,3,1,0), and (10,2,1,1) sites as if they were in a 
(10,4,0,0) configuration, etc. 

The infinite sites approximation also affects sites that are not multiply polymorphic. The 
essential problem is that the sampling mechanism used to calculate F{i) under the PRF 
method implicitly assumes that each mutant lineage occurs at a different site. The number 
of z-fold mutant sites is assumed to equal the number of z-fold mutant lineages sampled. Yet 
if multiple lineages are segregating at the same site, an i-fold and a /c-fold mutant lineage 
sampled at the same site can lead to an apparently {i + A;)-fold mutant site, if the two 
lineages happen to be mutations to the same nucleotide. In other words, a (12,2,0,0) site 
could reflect two low-frequency mutant lineages or one higher-frequency one, but the PRF 
incorrectly assumes that only the latter is possible. 

The inflnite-sites assumption is problematic whenever there are multiple mutations seg- 
regating at a site, even if they are at low frequency. Since a mutant lineage will survive 
on average 0(ln[l/|s|]) generations before flxing or going extinct, and mutations arise at 
rate Nfi per site, the inflnite-sites approximation will be valid only when Nfiln [l/\s\ ^] 1. 
This condition is often violated in real populations. In fact, several estimat es of mutation 



rates and selective pressures based on the PRF method violate this condition (IHartl et al. 
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19941 ) ■ We can also see explicitly that this condition is violated whenever a data set includes 



multiply polymor phic sites. Polymorp hisms of this type are indeed observed in data analyzed 
by PRF method ( Hartl et aU . 1994 ). 

Unlike the assumption of free recombination, the assumption of infinite sites is purely 
technical. In other words, we do not aspire to use the PRF method as a null model for 
exploring whether or not real systems violate the infinite sites assumption. This assumption 
is highly problematic because it induces systematic biases in the estimates of selection and 
mutation obtained by the PRF method: the method always treats an i-fold sampled site 
as having arisen from a single mutant lineage that has reached frequency i in the sample, 
ignoring the possibility of multiple mutant lineages at lower frequencies which sum to i. 
By disregarding the latter possibilities, the PRF method systematically underestimates the 
mutation rate u and the strength of selection Even though sites that violate the infinite 
sites approximation may be very rare, they have a disproportionate weight in estimates of 
these parameters and thus can lead to errors of an order of magnitude or more (see below). 
The method also erroneously infers positive selection in many situations where selection is 
actually negative. Both of these problems arise across a broad and biologically relevant 
range of parameters (Fig. 1). 

In addition to the systematic biases in the estimates it produces, the PRF method also 
wastes data. In particular, it makes no use of the information contained in the monomorphic 
sites of a sample. Given a 7 and 61, the PRF method predicts the number of sites yi that 
will exhibit i-fold polymorphisms, for i between 1 and n, the sample size. The PRF method 
makes no prediction about the number of monomorphic sites (more precisely the infinite sites 
approximation assumes that there is an infinite set of monomorphic sites among which the 
polymorphism is found). Thus if we had two sets of sequences with the same configuration of 
polymorphic sites but different numbers of monomorphic sites, ML estimation based on the 
PRF method would infer identical (per-sequence) mutation rates and selective coefficients. 
This holds despite the fact that monomorphic sites certainly contain information: a larger 
number of monomorphic sites reflects a lower mutation rate or stronger selection or both. 

Two other assumptions of the PRF model are worth mentioning. First, the PRF method 
assumes that all sites experience the same selective pressure. In practice, this means that 
the PRF estimate of the selection strength s is actually some sort of weighted average of the 



8 



selective forces acting on the sites analyzed. For example, since any site with |s| ^ will 
almost always be monomorphic, and data on monomorphism is ignored, such sites will not 
influence the PRF ML estimate of the "average" s. But beyond this, the weighting in this 
average is unclear, and hence it is unclear what the PRF ML estimate of s really represents. 
We explore this issue of variable selective pressures in more detail below. 

Finally, the PRF framework assumes that, once a mutation fixes at a particular site, 
additional mutations at the site will experience the same selective coefficient as the original 
mutation. For example, if a mutation from nucleotide ^4 to C fixes despite a selective 
disadvantage s, new mutations at this site (e.g. from C back to A) are assumed to again 
have selective c/isadvantage s. This unrealistic assumption arises because the PRF focuses 
on a steady state distribution of mutant lineages without references to the sites at which they 
occur — the fixation of old mutants is simply assumed to balance the continuous generation 
of new ones, but not to change the selective advantage of future mutations. There is no 
reference to the evolutionary dynamics at individual sites. Although unrealistic for negative 
selection, this approach has some advantages. In particular, it makes modeling positive 
selection straightforward: positive selection is simply a constant flux of new mutations that 
increase the fitness, and a steady state distribution of their frequencies is well defined. 

A PER-SITE POISSON RANDOM FIELD MODEL OF POLYMORPHISMS 

The problems with the PRF method described above can all be resolved by replacing it 
with an explicit model of evolution at each site. We develop such a model in the following 
section. First, however, we describe in this section a method that retains the basic PRF 
framework, but corrects some of the problems associated with the infinite-sites assump- 
tion, and takes full advantage of the information provided by the frequencies of all possible 
configurations at a site. 

The basic idea behind this modified approach is to recast the PRF framework on a 
per-site basis. We describe the steady state frequency distribution of mutant lineages at 
a given site. From this, we can calculate the probability that a sample of n individuals 
will contain any configuration of mutants at that site. As in the original PRF method, we 
retain the assumption of free recombination, so that the DNA sequence is a collection of 
independent sites. Thus our per-site analysis leads directly to ML estimation of mutation 
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rate and selection strength. 

We begin by recasting the PRF expression for the steady state distribution of mutant 
hneages to describe the frequencies of mutant hneages at a single site. At a given site, we 
have 

f(x) = Os- — 5 r-^, (4) 

•'^ ' ' 1 - x{l-xy ^ ' 

where Og is the per-site value, Og — 2N^. Using this formula to describe multiple lineages 
at a single site is somewhat peculiar, because this result assumes that all mutant lineages 
behave independently of one another. Clearly this is not strictly true, since the mutant 
lineages are segregating at the same site. However, provided two mutant lineages rarely 
achieve simultaneous high frequencies in the population, then the assumption of independent 
mutant lineages is a good approximation. This assumption of non-interacting mutant linages 
will often hold even when the other aspects of the infinite-sites approximation are violated. 

Analogous to the original PRF method, at a single site the number of mutant lineages 
which are observed i times in a sample of n sequences ("i-fold mutant lineages") is Poisson 
distributed with mean 



1 1 _ e-27(i-a;) I 



n 



Based on this, we can calculate the probability of any particular polymorphism configuration 
at a site. 

We begin by describing this calculation in the unfolded case. The probability that a site 
is monomorphic is just the probability that no i-fold mutant lineages are found at that site, 
for all i between 1 and n — 1. This is 

Pmono = P(nAO,0) = B-^^'^e-^^^ . . . e'^^'^"^) ^ M. (6) 

The probability that a site exhibits a (n — 1, 1,0,0) configuration is the probability that a 
single 1-fold mutant lineages is sampled, and no 2-fold or higher lineages are found, 

P(n-1,1A0) = F{1)M. (7) 

The probability of exhibiting an (n — 2, 2, 0, 0) configuration is more complex. This configu- 
ration could arise from a single 2-fold sampled lineage (as assumed under the standard PRF 
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method) or it could arise from two 1-fold sampled mutant lineages which happen to involve 
mutations to the same nucleotide. Hence its probability is 

P(n-2,2,0,0) = + (8) 

where the factor of | is the probability that two mutations result in the same nucleotide. 
This expression assumes that mutations between all possible nucleotides are equally likely 
— the obvious generalization applies when there are mutational biases, which we do not 
discuss further. Similarly the probability of an (n — 2, 1, 1, 0) configuration is 

^(n-2,M,0) = -^^3- (9) 

The probabilities of more complex configurations can be calculated in a similar way. The 
probability of an {n — 4, 4, 0, 0) configuration, for example, is the probability of four 1- 
fold lineages to the same nucleotide plus the probability of two 1-fold hneages and a 2-fold 
lineages, plus the probability of two 2-fold lineages, plus the probability of one 1-fold lineage 
and one 3-fold lineage, plus the probability of a single 4-fold lineage. We have 



-P(n-4,4,0,0) = M 



Filf 1 FilfFi2) 1 F(2)2l , , , a 

+ ^ o, ^ + -ttAo + ^(1)^(3) o + m 



(10) 



4! 33 2! 32 2! 3 ' ' ' '3 

In general, the probability of a particular configuration is given by the sum of the probabil- 
ities of all possible partitions of n that lead to that configuration. 

In the folded case, these calculations become even more complex. The probability of 
a (12, 2, 0, 0) folded configuration, for example, includes the probability of a single 12-fold 
sampled lineage, as well as two 6-fold sampled lineages, and so on. Thousands of terms may 
arise in the expression for the probability of a particular configuration, even for moderate 
values of n. We do not quote any of these results here, but rather we have developed a 
computer program to output symbolic expressions for the probabilities of all possible folded 
as well as unfolded configurations, for a given sample size n (available on request). 

These probabilities of site configurations form the basis of maximum likelihood param- 
eter estimation. The probability of a data set with L total sites, including La^^^^d sites in 
configuration (a, 6, c, d) is given by 

^' \\{pt;:oT^ (11) 
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where the products are over all possible configurations. Given a particular data set, we 
maximize this probability over 9s and 7 to find the ML estimate of these parameters. In the 
original PRF method, this ML estimation is particularly simple, because the ML estimate for 
9 can be expressed analytically in terms of the ML estimate for 7, leaving a one-dimensional 
numerical maximization procedure to estimate 7. In our per-site PRF method, however, a 
full two-dimensional numerical maximization is required to find the ML estimates of 9s and 
7- 

Our per-site PRF method relaxes most of the consequences of the infinite-sites approxi- 
mation inherent in the original PRF estimation procedure. We allow for the possibility that 
multiple mutant lineages contribute to the polymorphism observed at a single site. Thus 
we avoid the systematic underestimation of 9 and 7, and incorrect inference of positive se- 
lection, that affect the traditional PRF (see Numerical Simulations). This new method also 
uses all of the data available in the sample, including the number of monomorphic sites and 
the differences between [n — 2, 2, 0, 0) and {n — 2, 1, 1, 0) sites. It does still retain one aspect 
of the infinite-sites approximation: it assumes that mutant lineages segregating at the same 
site are independent (i.e. they do not interfere with each other). This is never strictly 
true, but is a good approximation unless multiple mutant lineages reach high frequency at 
a given site at the same time. Note that because of this assumption, the probabilities of 
all possible configurations (a, 6, c, d) described above do not precisely sum to unity, because 
our approach allows a typically small but nonzero probability of multiple mutant processes 
adding to more than n sampled individuals. Since the no-interference approximation is al- 
ways valid when the infinite-sites approximation is, and it holds in many situations where 
the infinite-sites approximation fails, our revised sampling method extends the applicability 
of the PRF framework and fixes many of its problems. 

A PER-SITE DIFFUSION MODEL OF POLYMORPHISMS 

In this section, we describe a method that shifts fundamentally from the PRF framework. 
Rather than studying the distribution of the frequencies of mutant lineages, we focus on the 
evolutionary dynamics at each individual site, without keeping track of individual mutant 
lineages. We develop this into a maximum-likelihood estimation of 7 and 9 from poly- 
morphism data, which requires neither the infinite-sites or no-interference approximation 
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described above. As in the original PRF method, we assume free recombination between 
sites. 

At an individual site, we imagine that one nucleotide is preferred, and the other three 
have the same fitness disadvantage s (s < 0). We assume that mutations occur at rate 
fi, and hence at rate /i/3 between any two specific nucleotides (i.e. no mutational biases). 
These assumptions simplify the discussion, but are not essential. In fact, one advantage of 
this approach is that these assumptions can be easily relaxed with obvious generalizations 
(noted below). 

We can analyze the process of mutation, selection, and drift at a single site with a 
three-dimensional diffusion approximation, and calculate the joint steady-state probability 
distribution of the frequencies of the four possible nucleotides at the site. This then leads 
naturally to the likelihood of any configuration of polymorphism data at the site as a func- 
tion of 7 and 6s, and hence to a ML estimation of these parameters from data. Alternatively, 
we can sacrifice some of the information in the data, and reduce the computational com- 
plexity of the problem by treating all three disfavored nucleotides as a single class. Such a 
treatment reduces to a standard one- dimensional diffusion process whose steady state prob- 
ability distribution describes the frequency of the preferred nucleotide versus the sum of 
th e frequencies of the disfavored ones; this treatment is essentially a steady-state version 



of 



Williamson et al. 



( 120051 ). This approach discards some of the information in the data 
(e.g. not making use of the difference between (12,1,1,0) and (12,2,0,0) sites), but it is 
computationally simpler. We begin by describing the one- dimensional method, and then 
turn to the three-dimensional method. 



The One-Dimensional Diffusion Model 



We begin by describing a simplified diffusion approach that calculates the frequency 
distribution of favored versus disfavored nucleotides. As noted above, we will for simplicity 
assume that one nucleotide is preferred, and the other three nucleotides are disfavored. We 
denote the sum of the frequencies of the three disfavored alleles by x; the frequency of the 
preferred nucleotide is 1 — x. 

We assume that mutation, selection, and random drift occur at each site according to 
standard Wright-Fisher dynamics. Thus the probability distribution of x can be described 
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by the diffusion equation 

P^^' t'^^"^^^^"^' + ^ [^(^)/(^' ^)] ' (12) 

where f{x, t) is the probabiUty that the disfavored nucleotides sum up to frequency x at time 
t, s is the selection coefficient against the disfavored nucleotides (s < 0), is the per-site 
mutation rate per individual per generation, and 

m{x) = sx{l — x) + — x) — —X (13) 

3 

v(x) = (14) 
This diffusion equation is well-known and has the steady-state solution 

f{x) = Cx''-\1 - x)''»/3-lg27X^ (15) 

where C is a {6s and 7-dcpcndcnt) normalization factor, and as before 9s = IN 11 and 7 = Ns. 

If the frequency of disfavored nucleotides at a site equals x, the probability that we find 
i such nucleotides in a sample of n individuals is (")x*(l — x)""*. Averaging over x, the 
overall probability that we sample i disfavored nucleotides at a given site is 

F{i) = (^^ J^' Cx'^+'-\l - :,y./^+n-i-i^2r. (^g) 

This integral, including the calculation of the normalization factor C, can be solved analyt- 
ically. We find 

. .X _ (n\ r(n - i + es/3)T{i + 9s)iF,{z + ^„ n + 4g,/3, 2j) 

[tj r{es/3)T{9s)iF,{esA0s/^2^) ' ^ ^ 

where F is Euler's Gamma function and iFi is a hypergeometric function. 

The expression above leads immediately to a maximum likelihood method for estimating 
7 and 9s in the unfolded case - i.e. when the identity of the preferred nucleotide at each site is 
known. In a sample of n sequences each of length L, we count the number of sites at which i 
disfavored nucleotides are sampled, yi, for < i < n. Since all sites are assumed independent, 
each with the polymorphism frequency distribution described above, the likelihood of the 
data given the parameters is 

T\ " 
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For any set of polymorphism data, it is straightforward to maximize this function numeri- 
cally, producing ML estimates of 9s and 7. As with the per-site PRF method, this procedure 
involves a two-dimensional maximization routine. 

When we do not know which nucleotide is preferred at each site, we must use the "folded" 
version of the data. This presents difficulties. Imagine a site with an (a, 6, c, d) polymorphism 
configuration. We might naively suppose that since any of the four nucleotides could be 
the preferred one, the probability of this data is simply F{h -|- c -|- rf) -|- F{a + c + d) + 
F{a + b + d) + F{a + b + c). However, this is not the case. For example, if a is indeed 
the preferred nucleotide, then F{b + c + d) does not equal the probability that the three 
disfavored nucleotides will form a (6, c, d) configuration. Rather, it equals the sum of the 
probabilities that the three disfavored nucleotides will form a configuration k), summed 
over all i,j, k triplets that sum to b + c + d. This is a serious problem, because the difference 
between F{b + c + d) and the probability of the data depends on b + c + d, and hence it 
is not identical for all four possible prefe rred nucleotid e s. Sim ply assuming that the most 



common nucleotide is the preferred one (IHartl et al. 



199J) is a reasonable approach to 



folded fits. But this approach will be inaccurate for sites where the most common nucleotide 
is not overwhelmingly so; when this situation describes a substantial fraction of sites, the 
method will fail. As a result, the one- dimensional diffusion framework does not allow for a 
rigorous ML estimate of parameters with folded data. The exact same problem also arises 
in the traditional PRF method, although it tends to be obscured by the other problems with 
that method. 

In order to perform rigorous ML fits to folded frequency data we must turn to a three- 
dimensional diffusion method, which we will now discuss. 



The Three-Dimensional Diffusion Model 



Rather than considering all disfavored nucleotides as a single class, we can instead keep 
track of the evolutionary dynamics of all four possible nucleotides at a site. In order to 
do so, we assume the standard four-allele Wright-Fisher dynamics, with mutation at rate 
I between any two particular nucleotides, and selection acting with strength s against the 
three disfavored nucleotides. The dynamics can then be described by a three-dimensional 
diffusion equation for the joint distribution of the frequencies of the three disfavored alleles 
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xi,X2, and X3, f{xi,X2, x^, t) (where the preferred allele has frequency xq = 1 —xi —X2 — xs). 
We have 

-3, t) = ^ E E 7^ ■f]-i:ir.[M^-f], (19) 



where 



i=l 7=1 j=l 



= (1 + (^1 - a;, j + (20) 

= ^^il^ (21) 
V^, = -^-^ i^^J)■ (22) 



This is a somewhat less well-known diffusion equation (IWattersoni . 119771 : IWrightI . 1 19491 ) ; 

the steady-state solution is 

fix,, X2, X3) = C [x,X2X3il -X,-X2- X^)]^ ^2,(.,+^,+.,) ^ (23) 

where C is a normalization factor, and we have defined ( = 9^/?) — 1. 

Given the frequencies Xq, Xi, X2, and X3 of nucleotides in the population, the probability 
of sampling a site in an unordered configuration (no, ni,n2, n^) in a sample of n individuals 
(adopting the convention that the first nucleotide listed is the preferred one) is just the 
multinomial probability 

, ,• ■ , (1 -x,-X2- xsr^Tx^xT- (24) 

Averaging over /, we therefore find that the probability of sampling a site in an unordered 
(no, rii, ?T,2, n3) configuration is 

Pno,ni,n2,n3 = C / / / ^ (25) 



no,ni,n2,n3 i i i i / / / 

no!ni!ra2!n3! Jq Jq Jq 

g27(xi+X2+X3)^C+ni^C+n2^f+n3 _ _ _ ^^^C+no dx^dX2dXi. (26) 

In some applications, we will know which of the nucleotides is preferred at each site — 
i.e. the unfolded case. In this case, the probability of finding a site in an ordered unfolded 
configuration (a, 6, c, d) (where by convention a is the number of individuals which have the 
preferred nucleotide and 6 > c > d), is 

Pa,b,c,d ~ ^ ] Pno,ni,n2,n3j (27) 

{no, ni, 722,713} 
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where the sum is over all unordered configurations {no, ni,n2, ris) that give rise to the ordered 
unfolded configuration (a, b, c, d). 

In other cases, we do not know which of the nucleotides is preferred at each site - i.e. the 
folded case. Here, the probability of sampling a site in the ordered configuration (a, b, c, d) 
is just the sum of the probabilities assuming that each of the four possible nucleotides is 
preferred. As before, we adopt the convention that ordered folded configurations are written 
as (a, 6, c, d) with a > b > c> d. The probability of a folded configuration Pabcd 

^a,b,c,d ~ Pno,ni,n2,n3J (28) 

{no,ni,n2,n3} 

where in this case the sum is over all unordered configurations that give rise to the ordered 
folded configuration (a, b, c, d) . 

For either folded or unfolded data, given n samples of a sequence L sites long, with Lafi,c,d 
sites in an (a, 6, c, d) polymorphism configuration, the likelihood of the data is 

LiOs, 7) = fTT^ n , (29) 

1 i ^a,b,c,d- 

where the products are taken over all possible configurations (a, b, c, d) and Pa,h,c,d is the 
folded or unfolded probability defined above. We can numerically maximize this function to 
find the ML estimates of Og and 7. 

In practice, the ML estimation procedure described above is difficult to implement, be- 
cause of the triple integral in the definition of Pno,ni,n2,n3 (as well as that implicit in the 
definition of the normalization constant C). This integral cannot be solved exactly, and it 
is difficult to evaluate numerically because the integrand may diverge (though the integral 
itself converges) near the boundary of the simplex over which it is integrated. We adopt 
a hybrid method to simplify the evaluation of this triple integral. Near the boundary of 
the simplex, we Taylor expand the integrand and integrate it analytically. Away from the 
boundary, the integrand is well-behaved and standard numerical integration has no difficul- 
ties. This approach is most easily achieved by making the substitutions y — x^/i^ — Xi— X2) 
and z — ^2/(1 — xi). On doing so, we can rewrite the integral as 

Pno,ni,n2,n3 — C I I I ^ (^'^) 



nolnilnalng! Jo Jo Jo 
exp [2^ {x + y + z - xy - zy - xz + xyz)] x 
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This expression is much easier to handle than our original expression, because the three 
integrals can be done in arbitrary order. We divide each of the three integrals into three 
pieces: one from to S, one from 6 to 1 — 5, and one from 1 — 5 to 1. Thus the triple integral 
is split into 27 total terms. For each of the integrals from to 5 or 5 to 1 — 5, we Taylor 
expand the integrand in the integration variable, and we solve the integral analytically. All 
of the remaining integrals, from 6 to 1 — 6, are done numerically. We must choose 6 large 
enough that we can perform the numerical integrals quickly, but not so large that the Taylor 
expansions used for the analytical parts become invalid. For these Taylor expansions, we 
need 6 <^ 6 <^ 1, 6 <^ 3C+2' ^ ^ 2C+T' S <^ ^. For the computational analysis 
described in this paper, we choose whichever of these conditions is most restrictive and 
set 6 to be one-tenth of the most restrictive requirement. We find that this choice of S 
is sufficiently small to provide accuracy in the analytical parts of the integrals, but large 
enough to enable quick numerical integration on the interior of the simplex. 



Comparison between the PRF and diffusion methods 

Both our one- and three-dimensional diffusion approaches relax all of the infinite-sites 
assumptions of the PRF framework. This includes the interference between mutant lineages 
segregating at the same site, which even the per-site PRF method mishandles. Thus, the 
diffusion approach contains none of the biases associated with infinite-sites approximation 
that plague the traditional PRF and, to a lesser degree, the per-site PRF. The diffusion 
approach also provides a clear and concrete model of the evolution at each site. This 
contrasts with the PRF method, which makes the unrealistic assumption that if a deleterious 
allele fixes at a site, further mutations are again deleterious. 

The diffusion method is also easily extendable to more complex evolutionary situations. 
For example, we can explore different selective costs for different nucleotides, more than one 
preferred nucleotide, or mutational biases. These possibilities lead to obvious modifications 
of the diffusion equations and their solutions, and hence to the maximum likelihood estima- 
tion. It is also straightforward to investigate balancing selection or the effects of dominance: 
these lead to well-unders tood modifications to the diffusion equations and their steady state 



solutions ( jEwENsLl2004l ). In the PRF framework, by contrast, such generalizations are much 



more complex. In particular, balancing selection is impossible to analyze under the PRF 
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framework, because it leads to mutant lineages reaching stable intermediate frequencies in 
the population. As a result, the generation of new mutations is not balanced by the extinc- 
tion or fixation of older ones, and hence no steady state distribution of lineage frequencies 
exists. 

However, the diffusion approach is not without drawbacks. The one-dimensional version 
does not make use of all of the data available in the observed polymorphism spectrum. 
The three-dimensional version does use all the data, but its implementation requires tedious 
numerical integration routines. 

The diffusion method also cannot naturally handle positive selection. The steady state 
evolutionary dynamics at a site are always dominated by the preferred nucleotide (or 
nucleotides), with negative selection acting against polymorphisms for the disfavored nu- 
cleotides. At the level of an individual site, positive selection is a process that is intrinsically 
out of steady state: the spread of a favorable nucleotide before it becomes fixed. The PRF 
method handles this by positing that there are a wide array of sites which have the poten- 
tial for positive selection, and assuming that a steady state across sites of these selective 
sweeps is maintained. It should be possible to address positive selection within the diffusion 
framework by changing the boundary conditions of our diffusion equations, so that the fix- 
ation of one mutant lineage shifts the selective landscape so that further mutations are now 
favored. Formally, any probability flowing into x = 1 is "absorbed" and moved to x = 0. 
Alternatively, one could use full time-dependent solutions to the diffusion equations to study 
positive selection. However, we do not pursue these approaches in this paper, and instead 
focus our study on the case of negative selection. 

VARIABLE SELECTION PRESSURES ACROSS SITES 

Both the original PRF method as well as the three per-site methods we have proposed in 
this paper assume that all sites experience the same selective p ressure. In rea l ity, w e expect 



that there is some distribution of selective pressures across sites. IHartl et al\ ( 1l994j ) suggest 
that in this case the ML estimate of 7 from their PRF method reflects an "average" selection 
pressure across the sites. In fact, the ML 7 reflects a weighted average of the actual 7's across 
the sites, but the nature of this weighting is not understood. Almost no weight is given to 
sites at which I7I ^ 1, because these sites will almost always be monomorphic and hence 
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ignored by the original PRF method. It is unclear how sites with different values of 7 of 
order 1 will be weighted, or how the presence of some effectively neutral sites (I7I <^ 1) will 
change the ML estimate. 

The issue of variable 7 across sites is of even greater concern for the methods we have 
proposed, because these methods make use of the monomorphism data. If some number L; 
sites are effectively lethal (i.e. I7I ^ 1), these sites will increase the number of monomorphic 
sites, which will tend to depress our ML estimate of Og and increase our estimate of |7|. 
Fortunately, our methods are able to use the monomorphism data in order to investigate the 
number of lethal sites, L^, or more generally to infer a full distribution of selection pressures 
across sites. 

Since all of the methods we have proposed are defined at a pcr-sitc level, it is straight- 
forward to assume that there are multiple different classes of sites with different values of 7. 
We can posit that there are k classes of sites. Each class is represented by of L total sites, 
and has its own value of 7, which we call 7^. The probability that a site is in an (a, 6, c, d) 
configuration is then 

k 

PaAc4-Y.j;Pi,,c,ai^M, (31) 
i=i 

where Piijcdilh^s) is the probability that site with parameters 7^ and 9s is in the configu- 
ration (a, 6, c, d). This expression is correct both for the per-site PRF and per-site diffusion 
approaches — so we can apply this method regardless of which approach we are using. 

Given our new definition of Pa,b,c,d, we can construct the folded or unfolded likelihood 
of the overall polymorphism data set in exactly the same way as before. This likehhood 
function now depends on 2fc -|- 1 parameters: 9s, the 7^, and the Lj. We can find ML 
estimates of all of these parameters using a multidimensional numerical maximization of the 
likelihood function. By choosing k, we determine the resolution at which we measure the 
distribution of values of 7 across sites. Naturally, the larger the k we choose, and hence the 
greater the resolution on 7, the more data we require to obtain accurate estimates of the 
individual Lj and 7^. 

Rather than estimating both the Lj and the jj, we could instead posit that there are 

several classes of mutations with different pre- specified 7^-, and estimate only the values of Lj. 
In other words, we ask what fraction of sites have different values of selective constraints. We 
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describe here one particularly important example of a hybrid between these two procedures, 
with two classes of sites (A; = 2). Rather than fitting an ML estimate of 7 to both classes, 
we assume that one class of sites is unable to evolve: mutations at these sites are lethal 
(more precisely, they have I7I ^ 1). We wish to calculate the number of lethal sites, and the 
average selective pressure on the remaining, non-lethal sites. Thus we have three parameters: 
the mutation rate 6*5, the number of lethal sites L2, and the strength of selection 7 acting at 
the other L1 — L — L2 sites. The probability that a site is monomorphic is given by 

_L2 Lipi 

^ mono ~ ]^ ]^ mono- \^^J 

Here Pmono probability that a site with strength of selection 7 and mutation rate 6s 

will be monomorphic, as defined by either the per-site PRF or per-site diffusion approach 
(whichever method we are using). The probability that a site is in a non- monomorphic 
(a, b, c, d) configuration is 

Pa,b,c,d — -J^Pl,h,c4- ("^"^) 

Now we can write the likelihood of the data in the usual way. This results in a three- 
dimensional ML problem. However, we can simplify the problem by first maximizing L2 
given 7 and 6. We find that the ML estimate of L2 is 

T _ r pi 

^mono ^^mono /oA\ 
^2 - 7—Bl ' i-^^j 

mono 

where Lmono is the number of monomorphic sites in the data and Pmono ^^e probability a 
non-lethal (i.e. an Li) site is monomorphic. Substituting this value for L2, we arc left with 
a two-dimensional maximization problem in 7 and 9s, similar to the original situation. 

It is worth exploring how this procedure for estimating the number of lethal sites utilizes 
the data. As we now show, this procedure is equivalent to ignoring the monomorphism data 
when finding the maximum-likelihood estimates of 7 and 9s for the non-lethal sites. The 
likelihood of the data ignoring monomorphic sites is 



1 i ^a,b,c,d 



a,b,c,d 



1 - Pl 

mono. 



(35) 



where Lp is the total number of non-monomorphic sites and the products are over all con- 
figurations of non-monomorphic sites. After finding ML estimates of 7 and 9 from this 
monomorphism-ignoring likelihood function, the procedure then calculates the number of 
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monomorphic sites that would be expected given 7 and 6s- This is LiP^^^ = {L — L2)P^ono- 
We then estimate the number of lethal sites L2 as the difference between the observed num- 
ber of monomorphic sites Lmono and the number that would be predicted if all sites were of 
the Li variety, L2 = Lmono ^ {L — L2)P^ono- Rearranging this expression, we see that it is 
identical to Eq. above. And indeed, plugging Eq. (152]) . Eq. (1551) . and Eq. (I51|) into 
Eq. ([31]) yields Eq. (l35]). 

Thus, this procedure ignores the monomorphism data when calculating 7 and 6s (at 
the non- lethal sites), and it instead uses the monomorphism data to infer one aspect of the 
distribution of 7 across sites — specifically, the number of lethal sites. Since the original PRE 
method also ignores monomorphism data, we obtain this information on the distribution of 
7 for "free," relative to the power of the original method, simply by shifting to the per-site 
model. If desired, we can also posit that there are a number of sites L3 which are effectively 
neutral (i.e. with I7I <^ 1), and estimate L3. This would devote some part of the data 
describing polymorphisms at intermediate frequencies to estimating L3, though the division 
in which data are used for estimating which parameters is not as sharp as in the case of 
lethal mutations. From this procedure we could estimate the number of effectively lethal 
sites, the number of effectively neutral ones, and the "weighted average" selection pressure 
acting on the remaining sites. If more resolution is desired, and enough data are available, 
we can increase the number of classes of sites and obtain ML estimates of the numbers of 
sites in each class and the selection pressure acting on each class. 



ACCURACY OF INFERENCE TECHNIQUES 

Using data generated from Wright-Fisher simulations, we have tested the inferential ac- 
curacy of the PRE method as well as the accuracy of our three alternative methods. The 
Wright-Eisher model (or, more precisely, its diffusion limit) forms the basis of the PRE 
method, and it is therefore the appropriate simulation framework for testing the method. 

All simulations assumed a constant population of = 1000 haploid individuals. Each 
of L = 1000 sites, simulated independently, could assume one of four states: a, c, t, or g. 
One state is assigned fitness 1, and the other three states fitness 1 + s. Mutations occurred 
at rate fi per s ite. The allele frequencies evolved according to the standard Wright-Eisher 



Markov chain (IEwensI . 12004] ). Each simulation was run for at least 10//i generations, so as 
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to ensure relaxation to steady-state. At the end of the simulation, n = 14 individuals were 
sampled from the population and the polymorphism frequency spectrum was recorded. For 
'unfolded' fits, the identity of the preferred nucleotide was retained, whereas this information 
was discarded for 'folded' fits. We cho s e to c onsider samples of size = 14 in order to 
facilitate comparison with lHARTL et all (119941 ). 



We performed simulations over a wide range of parameter values. We considered five 
different values of 9s: 0.05, 0.1, 0.5, 1.0, and 5.0. For each value of 9s, we performed one 
simulation at each of 17 different values of 7, ranging from 7 = —10.0 to 7 = —0.1. For 
each set of simulation parameters (7,^^), once the simulated polymorphism data had been 
generated, ML parameter estimates (7, 9s) were obtained by numerical maximization of the 
likelihood function, as specified by the original PRF model, the per-site PRF model, the one- 
dimensional diffusion model, or the three -dimensional diffusion mod el. A 95% confidence 

BusTAMANTE et all ( 120011 ): the interval includes 



interval for 7 was constructed according to 
those values of 7 within 0.5xio95 likelihood units from 7. The estimated parameters shown 
in Figures 1-5 are somewhat 'jagged,' because the inference methods have been applied to 
a single draw of n = 14 sequences for each set of simulation parameters, as opposed to 
averaging over many such draws. 

As discussed above, the original PRF model disallows multiple mutant lineages at a 
site. Therefore, when a site sampled from the simulated data exhibited more than two 
types of segregating nucleotides, the frequencies of all unpreferred nucleotides we r e sum med 



to represent the frequency of the 'mutant' type, as suggested by IHartl et al\ ( 1l994j ). In 
addition, when fitting folded data using the PRF m ethod, the mo s t coni mon nucleotide was 
assumed to be the ancestral type, as suggested by Hartl et al. ( jl994 ). This approach is 
not entirely accurate, as discussed above, but it is probably the best option available within 
the original PRF framework. 

Fig. 2 compares the accuracy of estimated selection pressures using the original PRF 
method versus the the one-dimensional diffusion method. Fig. 3 shows the same type 
of comparisons over a range of mutation rates, including also the modified, per-site PRF 
method. The original PRF method systematically underestimates the strength of negative 
selection, by as much as a factor of 10. In fact, the PRF method strongly rejects the true 
parameters in over 85% of the cases. In addition, the PRF method often erroneously infers 
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strong positive selection when in fact mutants are under negative selection. These problems 
are more severe when the mutation rate is large, but also occur for small mutation rates. 



The smallest mutation rate s^ 



bacterial genes (IHartl et al. 



lown in Fig. 3 is one-half the mutation rate estimated for 



I994J ). The per-site version of the PRF method that we 



have developed corrects the most severe problems of the standard PRF method, but it too 
exhibits systematic biases, especially when selection is weak and the mutation rate large 
(Fig. 3). The one-dimensional diffusion method that we have developed provides accurate 
and unbiased estimates of 7 over the full range of selective pressures and mutation rates 
(Fig. 3). Like its one-dimensional counterpart, the three-dimensional diffusion method also 
provides accurate and unbiased estimates over the full range of simulated parameters (not 
shown) . 

For folded data. Fig. 4 shows the accuracy of inferred selection pressures using the original 
PRF method, the per-site PRF method, and the three-dimensional diffusion method. Again, 
the original PRF method systematically underestimates the strength of selection, and it also 
erroneously infers positive selection. As before, the per-site PRF method corrects the most 
severe problems of the original PRF method, but it exhibits systematic biases at large 
mutation rates (Fig. 4). The three-dimensional diffusion method that we have developed 
provides unbiased estimates of selection pressures. When selection is weak {i.e. I7I < 1), 
however, the confidence intervals on diffusion-based estimates of 7 are appreciably larger 
in the folded case, compared to the unfolded case (Fig. 3c versus Fig. 4c). This behavior 
makes perfect sense: when selection is nearly neutral and the ancestor state is unknown, the 
frequency distribution does not exhibit sufficient skew to deduce the preferred nucleotide. As 
a result, the diffusion-based estimator cannot distinguish between weak positive and weak 
negative selection in the absence of information on the preferred nucleotide (Fig. 4c). Thus, 
the confidence intervals obtained under the folded diffusion technique properly reflect our 
inability to estimate the selection pressure precisely when selection is weak. 

As shown in Figures 3 and 4, when selection is weakly negative the original PRF method 
erroneously infers positive selection, regardless of the mutation rate. In fact, this problem 
occurs in the exact parameter regimes that have been estimated from biological data. For 



example, on the basis of n = 14 sampled sequences each 367-sites long, IHartl et al\ (Il994j ) 



estimated 7 = —1.34 and 9^ = 0.0915 for silent sites in a bacterial gene. If we simulate 367 
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Wright- Fisher sites under these parameters and sample n — lA sequences, we find that the 
most hkely parameters fit using the original PRF method are (7, 9s) — (+18.45, 0.067). This 
exercise demonstrates that the PRF method not only infers the wrong sign of selection, but 
it is also an inconsistent estimator under biologically realistic parameters. 

Fig. 5 shows the accuracy of estimated mutation rates using the original PRF method, 
the per-site PRF method, and the diffusion methods. The original PRF method systemati- 
cally underestimates the mutation rate, by a factor as large as 30. In the unfolded case, the 
tendency to underestimate the mutation rate is stronger when selection is weak. Estimates 
obtained using the per-site PRF method that we have developed partly correct these prob- 
lems, but still exhibit biases at large mutation rates. Our diffusion-based methods provide 
accurate and unbiased estimates of the mutation rate, for both folded and unfolded data, 
across the full range of mutation rates and selection pressures (Fig. 5). 

The methods we have developed in this paper also allow us estimate the distribution of 
selection pressures across sites. In one simple case discussed above, we have presented a pro- 
cedure for estimating the number of lethal sites and the selective pressure operating on the 
remaining, non-lethal sites in a gene. This procedure involves estimating 7 and 6 on the basis 
of polymorphic sites alone, and thereafter estimating the proportion of observed monomor- 
phic sites that are lethal. In order to assess the power and accuracy of this approach. Table 
1 shows the predicted number of monomorphic sites in each of our simulations, compared 
to the number of monomorphic sites actually observed. Across a large range of selective 
pressures and mutation rates, this approach typically estimates the number of (non-lethal) 
monomorphic sites within a few percent. As a result, for a gene of length L = 2000 sites, 
one-half of which are lethal, our procedure will accurately predict the number of lethal sites 
within a few percent; and it will accurately predict the selection pressure on the remaining, 
non-lethal sites. 

The simulations and fits presented in this section refiect our intuitive understanding of 
the assumptions underlying the PRF model versus the methods we have developed. For 
example, the PRF method disallows multiple mutant lineages at a site and should therefore 
lead to systematic underestimates of the mutation rate and selective strength, even when 
mutations arc segregating at low frequency. Our simulations and fits verify this behavior. 
The per-site PRF method corrects the most problematic aspects of infinite-sites assumption. 
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but it still leads to biased inferences at large mutation rates because it neglects interactions 
among high-frequency mutant lineages. Finally, the one- and three-dimensional diffusion 
models avoid the infinite-sites approximation altogether, as reflected by the accuracy of 
estimates obtained across the full range of parameters. 



DISCUSSION 



The Poisson Random Field model (ISawyer and HartlI. 119921 ) and the associated likeli- 



hood procedure for estimating parameters (IHartl et al 



19941 ) are perfectly valid when the 



assumptions underlying the method are met — namely, infinite sites, free recombination, 
and constant selective pressure across sites. We have shown, however, that the PRF method 
leads to severely incorrect inferences in practice, because in reality genes contain a finite 
number of sites. We have developed three new methods that relax or remove the infinite- 
sites assumption. These new methods not only fix the problems associated with the PRF 
method, but they also extend the types of inferences that can be drawn from polymorphism 
data to include inferences on the distribution of selective pressures across sites. 

It may seem surprising that the infinite-sites approximation can lead to such drastic 
errors in the mutation rates and selective strengths inferred by the PRF method. After all, 
sites at which multiple mutant lineages are sampled are presumably very rare. Multiple 
polymorphisms (i.e. three or more segregating nucleotides) occ ur in only a f e w per cent of 



sites for much of the data analyzed by the PRF method (e.g. IHartl et al\ (119941 )). Yet 
despite their rarity, these sites have a large impact on maximum likelihood estimation of 7. 
Because 7 enters the likelihood function in the factor e^'^^, changing 7 has a much larger 
impact on the likelihood of sites with many mutant nucleotides than those with few. In other 
words, a single site with a high frequency of mutant nucleotides is very strong evidence for 
positive selection or low I7I, whereas a site with a low frequency of mutant nucleotides is not 
very strong evidence for the opposite. Thus even a very few sites at which multiple mutant 
lineages are sampled, but are incorrectly assumed to be the result of a single high-frequency 
mutant lineage by the original PRF method, can cause large inaccuracies in the inferred 7. 
These inaccuracies in 7 then force corresponding inaccuracies in the inferred 9s- 

Previous simulation studies have not observed these problems with the PRF method, 
and they appear to show that the PRF method makes accurate inferences of mutation rates 
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and selection pressures ( IBustamante et a/.l . l200ll ). However, these simulations themselves 
implicitly assume infinite sites, and hence they cannot be used to test this aspect of the PRF 
method. In this paper, we have simulated a finite number of sites that evolve according to the 
Wright-Fisher model that forms the basis of the PRF derivation. As our simulations and fits 
demonstrate, the infinite-sites assumption causes the original PRF method to systematically 
underestimate selective pressures and mutation rates, and to find positive selection where 
it does not in fact exist. The techniques developed here, by contrast, produce accurate 
estimates of mutation rates and selection pressures across a broad range of biologically 
reasonable parameters. 

The fact that a few highly polymorphic sites have a large impact on the inferred values of 
7 and points to another important aspect of the original PRF method. In assuming that 
all sites have the same value of 7, the PRF method infers some sort of "weighted average" 
of the variable selection strengths across sites. This is not necessarily a bad thing — in the 
face of limited data, inferring the full distribution of selection strengths is impossible, and 
we want instead to have a rough sense of the average strength and direction of selection. 
However, the exact nature of this "weighting" is unclear. Since a single highly polymorphic 
site has a much larger effect in reducing the inferred value of I7I (or in suggesting positive 
selection) than a site with a low frequency of mutant nucleotides has in doing the reverse, 
the weighting clearly emphasizes neutral, nearly neutral, or positively selected sites more 
than deleterious sites. In particular, since the PRF method ignores monomorphic sites, it 
does not weight sites at which mutations are lethal. Thus, the weighting of the original PRF 
method is useful for increasing the sensitivity of the method to detecting positive selection, 
but the details of how this weighting works are poorly understood despite being crucial for 
understanding what an inference of positive selection really means. 

The methods we have developed allow us to relax the assumption that all sites have 
the same 7, and instead infer aspects of the distribution of selection pressures across sites. 
It is not yet clear how much data is required to provide adequate power for inferring this 
distribution to a given resolution. However, we can hope to gain a great deal of insight with 
only a few additional parameters — say, the number of sites which are neutral, lethal, and 
negatively selected, and the "weighted average" selection pressure on the latter class. As we 
have shown, we can estimate the number of lethal sites with no reduction in power relative 
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to the original PRF method, so this proposal would only involve one additional parameter. 
Additional classes of sites would involve one or two extra parameters each, depending on 
whether we specify or infer the selection pressure operating on these sites. The appropriate 
choice of resolution will depend on the context and quality of the data. We hope that 
experience gained from these approaches will also shed light on the nature of the weighting 
when a single "average" selection pressure is inferred from the same data, and allow us to 
interpret this number more precisely. 

Unlike the original PRF method, our two diffusion methods do not offer an easy way to 
infer the existence of positive selection. This limitation arises because positive selection is 
inherently out of equilibrium. The original PRF method and our modified, per-site PRF 
method handle this issue by positing rather strange dynamics at individual sites. All mutant 
lineages are assumed to be positively selected — so if a mutant nucleotide fixes at a given 
site, mutations back to the ancestral nucleotide are again assumed to be positively selected. 
While this assumption makes little sense at a per-site level, it allows us to obtain a steady 
state across sites, provided positive selection is ongoing and not saturated. We could modify 
our diffusion methods to mimic the PRF treatment of positive selection by changing the 
boundary conditions in our diffusion equations. Specifically, we would assume that prob- 
ability flowing into x = 1 (i.e. fixation of a mutant nucleotide) is absorbed and moved to 
X = (i.e. "reset" so that new mutations will again be favored). This diffusion equation 
can be solved exactly, and the solution used as a basis for inferring positive selection using 
the the per-site diffusion methods we have developed. 

Ideally, however, we would like to infer positive selection in the context of a realistic and 
well-defined model of the dynamics at individual sites. Such an approach would necessar- 
ily involve solutions fo r the transient dynamics of positively selected mutations sweeping 
through a population. IKimuraI (119551 ) found the full time- dependent solution for the dif- 
fusion equation describing the dynamics of a positively selected allele. Such processes are 
initiated as mutations arise, at Poisson-distributed times. Thus we could construct an ex- 
pected frequency distribution across sites consisting of a superposition of time-dependent 
solutions at each site, and use this as the basis for inference of positive selection. In this 
paper, we do not pursue either of these methods for detecting positive selection within the 
per-site diffusion framework, but this remains an important direction for future work. 
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Regardless of the methodology, it will always be difficult to discriminate positive selection 
from negative selection on the basis of the polymorphism frequency spectrum alone, partic- 
ularly when only folded data are available. Whether selection is positive or negative, mutant 
lineages drift nearly neutrally when their frequency is between and Positively selected 
lineages then fix relatively quickly once their frequency becomes substantially larger than 
while negatively selected lineages rarely ever reach frequencies greater than Thus 
from the point of view of the polymorphism frequency spectrum, positive selection is similar 
to random drift on [0, 1/I7I], with the upper bound a roughly absorbing boundary condition. 
Negative selection, on the other hand, is also similar to random drift on [0, 1/I7I], but with 
the upper bound a roughly refiecting boundary condition. Although this is relatively crude 
— selection does in fact have some impact on low-frequency lineages, and the boundary 
conditions are not exactly absorbing or refiecting — it indicates that the polymorphism fre- 
quency spectrum is roughly similar for negative and positive selection at the same |7|. Thus 
power to distinguish positive from negative selection based on the polymorphism frequency 
spectrum, especially with folded data, will always be relatively limited, regardless of the 
method used. 

Inference methods that utilize both interspecies divergence as well as intraspecies poly- 



morphism at synonymous anc 



( IMcDoNALD and Kreitman 



nonsynonymous sites (i.e. McDonald-Kreitman type tests 



I991I )) are often superior to those that rely on the full in- 



traspecific polymorphism frequency spectrum. While these methods lose some power by 
ignoring information about the full polymorphism frequency spectrum, they utilize data 
from more than one species as well as a key biological assumption - that synonymous sites 
are neutral - to provide a sort of internal control. As a result, su ch methods ar e typi 



cally less sensitive to many of the assumptions of the PRF model (ILoewe et al 



2006 



Sawyer and Hartl 



I992I ). including the infinite-sites assumption. Compared to the orig- 



inal PRF method, violation of the infinite-sites assumption causes less severe errors under 
McDonald-Kreitman type inferences, because highly polymorphic sites no longer have a 
disproportionate impact on inferred parameters. The quantification of such biases and the 
development of a finite-site framework for McDonald-Kreitman type inferences remain topics 
for future research. 
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Selection pressure y 

FIG. 1 Maximum-likelihood estimates of selection pressures obtained under the PRF method for 
unfolded polymorphism data. The figure shows the estimated selection pressure 7, on the y-axis, 
obtained by applying the PRF method to n = 14 sequences sampled from a simulated Wright- 
Fisher population (see Accuracy of Inference Techniques). Dashed lines indicate the 95% confidence 
interval around the ML estimate. The true selection pressure 7 used in the simulations is shown on 
the X-axis. The line 7 = 7 (i.e. a perfect prediction) is shown in red, and the line 7 = is shown 
in gray. The mutation rate is = 0.5 per site. The PRF method systematically underestimates 
the strength of selection, and it often leads to erroneous inferences of strong positive selection 
when selection is negative. In all cases, the PRF method strongly rejects the true parameters (the 
difference in the natural log likelihood between 7 and 7 ranges from 14 to over 700). 
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FIG. 2 A comparison between the accuracy of inferred selection pressures under the PRF method 
(top, as in Fig. 1) versus the one-dimensional diffusion method (bottom). In both cases, selection 
pressures were estimated from the unfolded polymorphism frequencies among n = 14 sequences 
sampled from a simulated Wright-Fisher population. Dashed lines indicate 95% confidence intervals 
around the ML estimates. The line 7 = 7 is shown in red, and the line 7 = is shown in gray. 
Unlike the PRF method, the diffusion method provides unbiased estimates of the selection pressure, 
and it does not lead to erroneous inferences of positive selection. 
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FIG. 3 Maximum-likelihood estimates of selection pressm'es obtained under the PRF method, the 
modified (per-site) PRF method, and the one-dimensional difi'usion method. Selection pressures 
were estimated from the unfolded polymorphism frequencies among n = 14 sequences sampled 
from a simulated Wright-Fisher population. In each panel, the simulated selection pressure 7 is 
shown on the j;-axis, and the estimated selection pressure 7 on the y-axis. Dashed lines indicate 
95% confidence intervals around the ML estimates. The line 7 = 7 is shown in red, and the line 
7 = is shown in gray. Simulations and fits were performed across a range of mutation rates, 
shown in separate rows. The diffusion method and, to a lesser extent, the modified PRF method 
correct the biases inherent in the original PRF method, especially at large mutation rates. 
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FIG. 4 Maximum-likelihood estimates of selection pressures obtained under the PRF method, the 
modified (per-site) PRF method, and the three-dimensional diffusion method. Selection pressures 
were estimated from the folded polymorphism frequencies among n = 14 sequences sampled from 
a simulated Wright-Fisher population. The diffusion method and, to a lesser extent, the modified 
PRF method correct the biases inherent in the original PRF method. When selection is weakly 
negative, the diffusion method cannot reject positive selection on the basis of folded data, as 
indicated by the lack of the upper confidence interval for some of the fits. 
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Mutation rate 6 

FIG. 5 Maximum-likelihood estimates of the mutation rate, Og = '^Nfi, obtained under the PRF 
method (blue), the modified PRF method (green), and the diffusion methods (red). Mutation rates 
were estimated from the unfolded (top) and folded (bottom) polymorphism frequencies among 
n = 14 sequences sampled from a simulated Wright-Fisher population. The simulated mutation 
rate is shown on the x-axis, and the estimated mutation rates 9s on the y-axis. The line Os = Og 
is shown in black. For each value of Og, simulations and fits are shown for 17 different values of 
7, ranging from 7 = —10.0 to 7 = —0.1. The PRF method systematically underestimates the 
mutation rate, especially when selection is weak. The diffusion methods provide accurate and 
unbiased estimates of the mutation rate for both folded and unfolded data, across the full range of 
parameters. 
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Accuracy of Predicted Number 


of Monomorphic 


Sites 


Actual 6s 


Average Error 


Median Error 


0.05 


7.6% 


7.8% 


0.1 


3.3% 


1.8% 


0.5 


2.3% 


1.1% 


1.0 


1.4% 


0.5% 


5.0 


0.1% 


0.1% 



TABLE I Accuracy of estimates for the expected number of monomorphic sites. We calculated 
maximum-likelihood estimates of 7 and 6s using the onc-dimcnsional diffusion method applied 
to unfolded simulated data excluding monomorphic sites. From these values, we calculated the 
expected number of monomorphic sites. Shown are the differences between the observed and 
expected number of monomorphic sites (in a simulated gene of length L = 1000 sites). For each 
value of 6s, we show both the average and median differences in numbers of monomorphic sites, 
across simulations with 7 ranging from —0.1 to —10. These results imply that we can accurately 
estimate the number of sites monomorphic due to drift. Thus, if a gene contains a similar or larger 
number of lethal site, we can also estimate the number of such lethal sites to within the above 
accuracy. 
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